The Network#

import numpy as np
import pickle
import networkx as nx
import pandas as pd
import holoviews as hv
import geoviews as gv
from colorcet import palette
import bokeh
import warnings
import matplotlib.pyplot as plt
import matplotlib.lines as mlines

from bokeh.sampledata.us_states import data as us_states
from mpl_toolkits.basemap import Basemap as Basemap
from holoviews.operation.datashader import datashade, directly_connect_edges
from shapely.errors import ShapelyDeprecationWarning

warnings.filterwarnings("ignore", category=ShapelyDeprecationWarning) 
hv.extension('bokeh')

The dataset used to for the project was created form a larger one collected reaserchers from ELTE, availabel on the Kooplex platform.

Kooplex#

The main task here, was to extract the users whose main geographical location was within the United States of America and then extract the followers of these users using SQL queries. There were two tables used Twitter.dbo.user_location_cluster and Twitter.dbo.user_follower. The first is queryed to get the geolocation of a twitter user in latitude and longitude, the second is to extract the followers of the formerly selected users.

To only choose twitter users from the United States, the users needed to bounded in a box, approximetly the area of the U.S.A., which I could get from this site.

min latitude

max latitude

min longitude

max longitude

24.9493

49.5904

-125.0011

-66.9326

After extracting the user_id-s and follower_user_id-s, I filtered the followers to only contain users from the U.S. aswell. The number of users in the U.S.A. is 7198227 and the number of edges (follows) is 34913927, but I had to reduce the edges to contain only bidirectional edges, meaning instances where to users follow eachother mutually. The number of these edges is 22201314 and the number of users in this network is 2795066. These are our final numbers.

edges = pd.read_csv(r'C:\Users\dajka\Documents\Egyetem\MSC\III\dsdatasci\data/bidirectional_edges.csv')
user_df = pd.read_csv(r'C:\Users\dajka\Documents\Egyetem\MSC\III\dsdatasci\data/us_users_in_network.csv')

Interactive visualization with datashader#

The first one is an interactive visualization of network, but to make it more navigable I limited the number of users to the ones with the most followers.

The code is partially from this site

# Select the 150 most followed users
most_followed_df = user_df.sort_values('follower_number', ascending=False).iloc[:150,]
# Select only US mainland airports
user_points = gv.Points(most_followed_df, kdims=['lon', 'lat'])\
    .select(Latitude=(20, 70), Longitude=(-175, -50))

routes = edges[edges.iloc[:,0].isin(user_points.data.user_id) &
                edges.to.isin(user_points.data.user_id)]

# Convert from Mercator to Latitudes/Longitudes
user_points = gv.operation.project_points(user_points)
# Declare nodes, graph and tiles
nodes = hv.Nodes(user_points.data, kdims=['lon', 'lat', 'user_id'],
                 vdims=['follower_number'])
graph = hv.Graph((routes, nodes), kdims=['from', 'to'], vdims=['from', 'to'])
tiles = gv.WMTS('https://maps.wikimedia.org/osm-intl/{Z}/{X}/{Y}@2x.png')
%%opts RGB () Graph [width=800 height=800] (edge_selection_line_color='black' edge_hover_line_color='red')
%%opts Graph (node_size=8 edge_line_alpha=0 edge_hover_line_alpha=1 edge_selection_line_alpha=1 edge_nonselection_line_alpha=0)

tiles * datashade(directly_connect_edges(graph), cmap=palette.bgy, width=800, height=800) * graph

Plot the network with Basemap#

An other visualization is of the twitter users from the U.S. using networkx and Basemap, a python library. The code is from: https://tuangauss.github.io/projects/networkx_basemap/networkx_basemap.html

network